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ABSTRACT 

The concept of orthology provides a foundation for 
formulating hypotheses on gene and genome evolu- 
tion, and thus forms the cornerstone of comparative 
genomics, phylogenomics and metagenomics. We 
present the update of OrthoDB— the hierarchical 
catalog of orthologs (http://www.orthodb.org). From 
its conception, OrthoDB promoted delineation of 
orthologs at varying resolution by explicitly referring 
to the hierarchy of species radiations, now also 
adopted by other resources. The current release 
provides comprehensive coverage of animals and 
fungi representing 252 eukaryotic species, and is 
now extended to prokaryotes with the inclusion of 
1115 bacteria. Functional annotations of orthologous 
groups are provided through mapping to InterPro, 
GO, OMIM and model organism phenotypes, with 
cross-references to major resources including 
UniProt, NCBI and FlyBase. Uniquely, OrthoDB 
provides computed evolutionary traits of orthologs, 
such as gene duplicability and loss profiles, diver- 
gence rates, sibling groups, and now extended with 
exon-intron architectures, syntenic orthologs and 
parent-child trees. The interactive web interface 
allows navigation along the species phytogenies, 
complex queries with various identifiers, annotation 
keywords and phrases, as well as with gene copy- 
number profiles and sequence homology searches. 
With the explosive growth of available data, 
OrthoDB also provides mapping of newly sequenced 
genomes and transcriptomes to the current 
orthologous groups. 

INTRODUCTION 

Homology in molecular biology refers to a common 
ancestry. In practice, homologous genes are recognized 



through the assessment of the statistical significance of 
sequence similarities of aligned nucleotides or amino 
acids. With reference to a specific species radiation, hom- 
ologous relations define orthologs — 'equivalent' genes in 
different species descended from a single ancestral gene 
(1-3). Speciation events, gene duplications, losses and 
sequence mutations lead to the diversity of genes 
encoded in the genomes of modern species. For any 
given set of species, all the descendants of a single gene 
from their last common ancestor constitute an 
orthologous group of genes. Orthology is therefore inher- 
ently hierarchical, referring explicitly to the last common 
ancestor, such that mostly one-to-one orthologs are 
identified among closely related species, whereas among 
more distantly related species orthologous groups 
comprise all surviving descendants of the ancestral gene. 

There are two main approaches for orthology delinea- 
tion: (i) algorithms that cluster all-against-all pairwise 
sequence comparisons, usually first identifying best- 
reciprocal matches between genomes that correspond to 
the shortest path over the speciation node of a 
distance-based tree, e.g. (4-12); and (ii) phylogeny-based 
methods that first define homologous gene families, build 
gene trees for each family, and then explicitly or implicitly 
reconcile them with the species tree often employing 
assumptions on rates of gene losses and duplications, 
e.g. (13-18). Phylogeny-based approaches have more par- 
ameters and may therefore yield better accuracy given suf- 
ficient data, but are often limited by the quality of multiple 
sequence alignments. This approach also considerably in- 
creases computational demands and becomes impractical 
for hundreds of species. 

Recent benchmarking of prominent orthology resources 
(19,20) show that in the trade-off between specificity and 
sensitivity, OrthoDB assignments favor greater specificity 
with reasonable sensitivity, a balance that is well-suited to 
the goal of inferring gene functions. Although orthology 
is strictly an evolutionary concept, it can support the 
tentative transfer of functional annotations from well- 
studied organisms to orthologs in newly sequenced 
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species. The confidence of such hypotheses on gene 
function may be qualitatively gauged by the genes' evolu- 
tionary histories, e.g. more confident inferences may be 
made for orthologs that are preserved across many 
species mostly as single-copy genes, with relatively low 
levels of sequence divergence, and consistent protein 
domain architectures. Gene duplicates in multi-copy 
orthologous groups often exhibit greater sequence diver- 
gence than single-copy orthologs (21), and as this may 
reflect biological innovation, any inferences on gene 
function should be made cautiously. OrthoDB classifica- 
tions have proved to be accurate and biologically relevant 
as assessed within the framework of several recent genome 
projects, e.g. (22-26). Thus, the evolutionary characteriza- 
tion of orthologous groups in OrthoDB, collated with 
available gene functional annotations, provide a strong 
basis for making informed hypotheses that can drive evo- 
lutionary and molecular biology research. 



SPECIES SAMPLING 

The current OrthoDB release includes more than 250 eu- 
karyotes and now also extends to cover prokaryotes with a 
total of 1115 bacterial species (Table 1, Supplementary 
Table SI). The predicted protein-coding gene sets and 
their corresponding General Feature Format (GFF) an- 
notations for 52 vertebrate species were retrieved from 
Ensembl (27) (Release 67, May 2012). Data for the 45 
arthropods were sourced from AphidBase (28), 
BeetleBase (29), FlyBase (30), Hymenoptera Genome 
Database (31), SilkDB (32), VectorBase (33), wFleaBase 
(34) and several genome consortia (as of July 2012). Gene 
sets for an additional 13 basal animal species were 
retrieved from Ensembl Genomes (35) and the Joint 
Genome Institute (36) (as of July 2012). The 142 fungal 



gene sets were retrieved from UniProt (37) (July 2012 
release) and the bacteria were retrieved from NCBI (38) 
(Supplementary Table SI). 



HIERARCHICAL ORTHOLOGOUS GROUPS 

The OrthoDB orthology delineation procedure is based on 
clustering of best-reciprocal-hits (BRHs) between genes 
from each species pair, determined from all-against-all 
Smith-Waterman protein sequence comparisons now 
using SWIPE (39). The clustering procedure considers 
only the longest transcript per gene, and only the longest 
of all gene copies in a single genome with over 97% amino 
acid identity as determined by CD-HIT (40). Clusters are 
built progressively, with an e-value cutoff of le-3 for 
triangulating BRHs, and le-6 for pair-only BRHs, 
requiring an overall minimum sequence alignment 
overlap of 30 amino acids. The clusters of BRHs are sub- 
sequently further expanded to include all in-paralogs 
recognized as within-species homologs that are more 
closely related than the clustered BRHs. 

Since its conception, OrthoDB (41) has promoted the 
concept of hierarchical orthology classifications by 
applying the clustering procedure at each radiation point 
of the considered species phylogeny and allowing users to 
explicitly select the most relevant level. It is rewarding to 
note that other resources e.g. (7,8) have embraced this 
concept and now provide orthology classifications at 
several major radiations across the tree of life. To deter- 
mine the OrthoDB hierarchy, the species phytogenies in 
the current release were empirically computed using a 
maximum-likelihood approach as implemented in 
FastTree (42) over the super-alignment of mostly single- 
copy orthologs defined at the root node, multiply-aligned 



Table 1. OrthoDB species and gene content 



Lineage Input genes Classified Percentage of classified genes 
Representative species genes (%) 





Total 


Average 




in groups with 
annotation(s) 1 ' 


in groups with 
phenotype(s) b 


52 Vertebrates 


951245 


18 293 


92.7 


96.3 


48.4 


Homo sapiens 


20 827 


na 


94.9 


93.5 


45.6 


Mus musculus 


23 075 


na 


87.0 


96.5 


47.9 


Danio rerio 


26 206 


na 


80.7 


96.9 


48.5 


45 Arthropods 


746 324 


16 585 


71.1 


87.1 


25.1 


Drosophila melanogaster 


13 927 


na 


96.1 


86.5 


26.6 


110° Metazoa 


1974 947 


17 954 


81.9 


93.5 


60.8 


Caenorhabditis elegans 


20 517 


na 


71.5 


84.7 


61.4 


142 Fungi 


1 223 848 


8619 


85.0 


86.8 


49.3 


Saccharomyces cerevisiae 


6652 


na 


96.2 


91.9 


94.8 


1115 Bacteria 


3 532434 


3168 


91.0 


91.6 


47.1 


Escherichia coli 


4149 


na 


97.8 


97.7 


98.8 


Haemophilus influenza 


1657 


na 


98.2 


98.8 


85.3 


Mycobacterium tuberculosis 


3977 


na 


95.5 


93.3 


35.9 



Statistics describing OrthoDB species coverage of vertebrate, arthropod, basal metazoan, fungal and bacterial orthologs with rich functional 
annotations. 

"GO terms or InterPro domains. 

b From Online Mendelian Inheritance in Man, the Mouse Genome Database, the Zebrafish Model Organism Database, FlyBase, WormBase, 
Saccharomyces Genome Database, EcoGene or the Database of Essential Genes. 
c 13 basal metazoan species plus 52 vertebrates and 45 arthropods. 
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using MAFFT (43), and filtered using TrimAl (44), and 
corroborated with known taxonomies from the literature. 

The hierarchical orthology delineation procedure of the 
sampled lineages of vertebrates, arthropods and fungi 
classified 84% of a total of 2 921417 protein-coding 
genes into 25 371, 33 393 and 55 793 orthologous groups, 
respectively (Table 1). Root-level delineation across the 
110 animal species defined 58 308 orthologous groups 
covering 82% of the 3 198 795 metazoan genes and clus- 
tering of the 1115 bacteria classified 91% of the 3 532434 
bacterial genes. In addition to the root-level orthologs, 11 
subgroups of bacteria — corresponding to the NCBI 
taxonomy 'class' levels — were clustered to provide more 
fine-grained orthologous groups for Actinobacteria, 
Spirochetes, Tenericutes, Thermotage, two classes of 
Cyanobacteria and Firmicutes, and three classes of 
Proteobacteria. 



MAPPED FUNCTIONAL ANNOTATIONS 

As orthologous groups comprise genes descended from a 
common ancestor, functional attributes ascribed to one or 
more members can be tentatively extrapolated to the last 
common ancestor and describe the group as a whole. In 
this way, orthologous group summary annotations 
provide an overview of mapped functional attributes 
with links to respective source databases to allow further 
investigations of the putative biological roles of their 
member genes (Figure 1). 

Concise descriptors 

Gene functional descriptions sourced from UniProt (37) 
and NCBI (38) provide succinct indications of known or 
inferred biological functions with coherent nomenclatures 
based on data from the literature as well as biocurator- 
evaluated and automatic computational classifications 
and annotations. In this OrthoDB release, frequently 
occurring phrases from member-gene descriptions label 
the group with a meaningful descriptor for each 
orthologous group. 

Gene ontologies and InterPro domains 

Molecular function, biological process and cellular com- 
ponent Gene Ontology (GO) (45) terms were retrieved 
from UniProt (37) and InterPro (46) protein domain sig- 
natures were sourced from the UniProt Archive of 
sequences. The available functional evidence for each 
orthologous group is summarized by listing the frequen- 
cies of associated GO terms and InterPro domains with 
concise attribute descriptions. Additionally, InterPro 
matches are displayed with domains ordered sequentially 
from the N- to C-terminus, describing the complete 
domain architecture of multi-domain genes, thereby 
allowing database queries with specific domain combin- 
ations. More than 85% of orthologs from each of the 
lineages are classified in groups that can be described by 
either GO terms or InterPro domains (Table 1). 



Model organism phenotypes 

OrthoDB gene annotations are enhanced with detailed 
functional data from well-studied model organisms in 
each lineage to highlight phenotypes associated with 
genes from Mus musculus, Drosophila melanogaster and 
Saccharomyces cerevisiae, sourced from the Mouse 
Genome Database (47), FlyBase (30) and Saccharomyces 
Genome Database (48), respectively. Eukaryotic model 
organism phenotypes now also include Danio rerio from 
the Zebrafish Model Organism Database (49) and 
Caenorhabditis elegans from WormBase (50). For 
bacteria, gene annotations are extended with phenotype 
data from EcoGene (51) for Escherichia coli genes and 
from the Database of Essential Genes (52) which covers 
16 bacteria including E. coli, Haemophilus influenza and 
Mycobacterium tuberculosis (Table 1). 

Online Mendelian inheritance in man 

Human gene annotations are now enhanced with links to 
online Mendelian inheritance in man (OMIM®) (53), the 
catalog of associations between causative genes and 
human disease phenotypes, which describes thousands of 
allelic variants linked to numerous different disorders or 
susceptibilities. Mapping of human genes in OrthoDB to 
OMIM® records highlights known disease associations for 
almost 3000 genes (Table 1). 

COMPUTED EVOLUTIONARY ANNOTATIONS 

OrthoDB presents quantified orthologous group charac- 
teristics that describe evolutionary properties such as gene 
duplications or losses and rates of sequence divergence, 
these detail their evolutionary histories and provide a 
basis for the assessment of the confidence with which in- 
ferences on gene function may be made (Figure 1). 

Phyletic profiles 

Orthologous group phyletic profiles contrast the number 
of species with single-copy versus multi-copy orthologs 
and indicate the species coverage at the selected radiation 
point. The profiles thus highlight how descendant genes 
have been preserved across the phylogeny and whether 
gene duplications are widespread ('multi-copy license') 
or restricted ('single-copy control') as discussed in (21). 

Evolutionary rates 

The relative divergence among orthologous group 
member genes is quantified as the average of inter-species 
protein sequence identities normalized to the average 
identity of all inter-species BRHs. Appreciably higher or 
lower rates of divergence distinguish groups of orthologs 
with restrained or relaxed rates of protein sequence evo- 
lution, e.g. essential-gene-containing groups usually 
exhibit greater sequence conservation than those without. 

Sibling groups 

Homologous relations among genes from different 
orthologous groups at a given species radiation identify 
homologous or 'sibling' orthologous groups. 
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Figure 1. Screenshot of a sample orthologous group results page, featuring functional and evolutionary annotations, the inferred parent-child 
tree and syntenic orthologs. 



gene 



These relations are quantified using data from 
all-against-all sequence comparisons by averaging over 
all pairs of homologs that link two orthologous groups 
with an e-value cutoff of le-3. This allows the user to 
retrieve sets of sibling orthologous groups that share sig- 
nificant sequence homology — which may therefore have 
some functional similarities — in an unbiased way that 
does not rely on protein domain or gene functional 
annotations. 

Parent-child trees 

Orthology delineation at each radiation along a given 
phylogeny hierarchically defines groups of orthologs 
with increasing resolution from the root level with the 
complete set of species to the most closely related species 
pairs. Parent-child relationships among orthologous 
groups delineated at each descendant radiation may 
therefore be defined by stepping along the phylogeny to 
identify orthologous groups with common subsets of 
genes (Figure 2). This new feature of OrthoDB represents 



these relationships as parent-child trees that illustrate the 
hierarchy of orthologous groups and their member genes, 
thereby building an inferred gene tree for a parent group 
by taking advantage of the greater resolution of its child 
groups. Users may view and edit the parent-child trees, as 
well as retrieve tree data formatted using Newick Utilities 
(54), from the 'Display Tree' window (Figure 1) that inte- 
grates the PhyloWidget (55) tool for the visualization and 
manipulation of phylogenetic tree data. 

Gene architectures 

Evolutionary annotations now also feature summary 
tables of protein lengths (all lineages) and exon counts 
(meatazoan lineages) that detail quantified mean, median 
and standard deviation values for each orthologous 
group, effectively describing a 'consensus' gene architec- 
ture. Amino acid and exon counts are also listed for each 
member gene, flagging those that are significantly shorter 
or longer than the consensus as potentially inaccurate gene 
model predictions. 
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Figure 2. Hierarchical parent-child trees. 




Syntenic orthologs 

Comparing the chromosomal arrangements of ortholo- 
gous genes among sets of species from the OrthoDB 
arthropod lineage identifies conserved blocks of syntenic 
orthologs. Such genes have maintained their local gene 
neighborhoods in the face of continual genomic evolution 
through sequence deletions, insertions and inversions, 
which may suggest selective advantages associated with 
their genomic arrangements, e.g. the TipE gene cluster 
of insect Para sodium channel auxiliary subunits (56). 
Ortholog-anchored synteny delineation (57) first identifies 
pairwise blocks with a minimum of two orthologs, 
allowing at most two intervening orthologs for each pair 
of genomes, and then successively projects these blocks 
through each pair of species across the phylogeny. The 
'OrthoBlock' viewer (Figure 1) displays the best block — 
weighted according to the evolutionary span of the species 
and the number of orthologous groups in the block — 
selected from all the resulting blocks with at least five 
species for each orthologous group. 



ORTHODB ONLINE 

Selecting any species radiation point of interest from the 
interactive species trees, users can navigate through the 
hierarchy of orthologous groups defined at each radiation 
of the eukaryotic species phylogenies and for 11 major 
bacterial clades. At each orthology level, text searches 
return results from matches to various database identifiers 
and annotation keywords or phrases that can be combined 
through logical operator syntax to build more complex 
queries (e.g. ['cytochrome c'-mitochondrial]) using 
Sphinx indexing technology (http://sphinxsearch.com/). 
In addition, database cross-referencing of gene identifiers 
enhances search term matches through available gene 
names and synonyms, InterPro, or GO identifiers, as 



well as secondary identifiers from UniProt, Entrez 
GenelD, RefSeq, Protein Data Bank, OMIM, PubMed 
and model organism databases. Copy-number profile 
searches retrieve groups matching specific user-defined or 
general pre-defined phyletic profiles by combining the 
criteria of absent, present, single-copy, multi-copy or no 
restriction, for each species within any selected clade. 
BLAST (58) sequence similarity searches identify the 
best matches to genes from different species classified in 
OrthoDB, thereby allowing database querying with 
protein sequence data from any species. Importantly, 
although such sequence similarity searches with a single 
gene can recognize its homologs, accurate mapping to the 
defined orthologous groups requires assessment of the or- 
ganism's complete gene set (see ortholog mapping section 
below). Searches stored during each user's web browser 
session provide a query history facility to allow recently 
executed queries to be reviewed, re-run or combined, e.g. a 
profile search for 'single-copy in >90% of species' could 
be combined with a text search with the GO identifier for 
'receptor activity' to retrieve groups of mostly single-copy 
receptors. All search results may be easily exported as 
either Fasta-formatted files of protein sequences or 
tab-delimited text files of gene annotations, and the 
complete datasets are provided for download. All 
OrthoDB features are described in a comprehensive 
online help page and users may contact sup- 
port@orthodb.org for additional information or specific 
requests, they may also subscribe to the low-traffic 
'orthodb-news' mailing list (https://list.unige.ch/ 
mailman/listinfo/orthodb-news) to keep abreast of the 
latest developments. 

OrthoDB links 

Search results present annotations for each orthologous 
group and tabulate all member genes with links to their 
respective sources e.g. Ensembl, UniProt, NCBI and 
FlyBase. Concise descriptors displayed for GO terms 
and InterPro domains are hyperlinked to their source 
records, and hyperlinks to OMIM and model organism 
databases provide direct access to all supporting data for 
genes with mapped phenotypes and synonyms. OrthoDB 
now provides FlyBase with orthology calls for the 12 
Drosophila species as well as to selected arthropods and 
other animals. In addition, classified genes in OrthoDB 
are referenced with link-outs from UniProt records and 
NCBI gene link-outs. 

Mapping of new species 

Through a recently developed ortholog mapping proced- 
ure and corresponding web interfaces, OrthoDB now 
provides orthology classifications for genes from species 
with newly sequenced genomes mapped to existing ortho- 
logous groups. The mapping procedure first compares all 
genes from the new organism to all genes in OrthoDB 
groups, and then performs the BRH clustering procedure 
only allowing new genes to be added to existing clusters. 
The web interfaces list mapped genes and mirror OrthoDB 
data from the lineage(s) to which the new species is 
mapped. Thus, OrthoDB now provides online browsing 
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of mapped orthologs for new species with publically avail- 
able gene sets such as the Chinese softshell turtle, 
Pelodiscus sinensis, (from Ensembl Release 68) (Supple- 
mentary Figure SI). Portals with restricted access 
provide the same functionality for private gene sets from 
organisms with recently sequenced genomes. For example, 
mapping the initial gene annotations of the genome of the 
alfalfa leafcutting bee, Megachile rotundata, helped to 
assess their quality and completeness, as well as providing 
a user-friendly portal to identify orthologs from other 
insects (G. Robinson, personal communication). 

BENCHMARKING SETS OF UNIVERSAL 
SINGLE-COPY ORTHOLOGS 

The fast-growing number of sequenced genomes and tran- 
scriptomes vary substantially in their completeness of 
sequencing, quality of read assembly and accuracy of 
gene annotation. A complementary approach to technical 
statistics such as the widely used N50 measure of genome 
assemblies, is to gauge the quality by examining the 
coverage of an expected gene set. This approach can 
assess not only completeness of genome coverage and 
fragmentation of the assembly, but also misassembly of 
haplotypes when the marker genes are known to exist only 
in single-copy, as well as the accuracy of annotation 
of such genes. For this purpose — of quality assessment 
of genomic data — we compiled benchmarking sets of 
universal single-copy orthologs (abbreviated BUSCOs) 
identified using OrthoDB for the Metazoan, Vertebrate, 
Arthropod and Fungal lineages (respectively, named 
BUSCO-Me, -Ve, -Ar, -Fu). Although these sets are in- 
tentionally conservative, they comprehensively sample 
each lineage and select representative genes from 
orthologous groups with single-copy orthologs in at least 
90% of the species. The BUSCOs are available for 
download as Fasta-formatted protein sequences with cor- 
responding gene, species and orthologous group 
identifiers. 



PERSPECTIVES 

The current OrthoDB release demonstrates the scalability 
of our computational procedures for the ab initio analysis 
of several millions of genes within a reasonable timeframe, 
e.g. with a 150 CPU-core computer cluster the total 
all-against-all sequence comparisons took about 1 month 
and the subsequent clustering procedures required from 1 
day for the arthropod set to 4 weeks for the largest 
bacteria dataset on a single machine using a 
multi-threaded algorithm. Nevertheless, its comprehensive 
application to all emerging data will become prohibitive in 
a few years due to the exponential scaling of genome 
sequencing as well as to the variable completeness and 
quality of new genome annotations. Thus, our approach 
will be to focus the complete clustering analyses on only a 
representative selection of the best annotated species and 
those that maximize phylogenetic coverage, corroborating 
the results with curated classifications. These will form a 
comprehensive set of well-annotated and trusted 



orthologies to which genes from the other genomes, e.g. 
the thousands of insects to be sequenced through the i5K 
initiative (59), and new transcriptomes, e.g. from the 
1KITE project (http://www.lkite.org), can be mapped. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1 and Supplementary Figure 1. 
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